Optimal parameters for bloom-filtered joins in Spark
نویسنده
چکیده
Схемы баз данных типа «Звезда» или «Снежинка» используют одну большую таблицу фактов и несколько маленьких таблиц измерения. Подобные схемы часто требуют фильтровки в таблицах измерения, поэтому такие схемы требуют обработки множества записей даже когда результат запроса маленький по объему. Наша работа не затрагивает исключительно подобные таблицы. В этой статье мы предположим, что у нас имеется только две таблицы, одна из которых по объему больше другой. Одна из таблиц достаточно маленькая (в статье будет раскрыто понятие «достаточно»). Другая таблица условно будет называться большой. Обе таблицы распределенные и находятся на одном кластере. Цель данного научного исследования — выполнить следующий запрос:
منابع مشابه
Lightning Fast and Space Efficient Inequality Joins
Inequality joins, which join relational tables on inequality conditions, are used in various applications. While there have been a wide range of optimization methods for joins in database systems, from algorithms such as sort-merge join and band join, to various indices such as B-tree, R⇤-tree and Bitmap, inequality joins have received little attention and queries containing such joins are usua...
متن کاملThe STARK Framework for Spatio-Temporal Data Analytics on Spark
Big Data sets can contain all types of information: from server log files to tracking information of mobile users with their location at a point in time. Apache Spark has been widely accepted for Big Data analytics because of its very fast processing model. However, Spark has no native support for spatial or spatio-temporal data. Spatial filters or joins using, e.g., a contains predicate are no...
متن کاملBloom Filters in Distributed Query Execution
The MapReduce framework [5] has emerged as a successful parallel computation model in large-scale data analytics, mostly due to its simple interface and its scalability over thousands of nodes. However, while various primitives, such as aggregations, are performed efficiently in this framework, more complicated relational algebra operations such as joins and multiway joins are still implemented...
متن کاملEfficient Skew Handling for Outer Joins in a Cloud Computing Environment
Outer joins are ubiquitous in many workloads and Big Data systems. The question of how to best execute outer joins in large parallel systems is particularly challenging, as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding prob...
متن کاملCuttlefish: A Lightweight Primitive for Adaptive Query Processing
Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefis...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1706.02785 شماره
صفحات -
تاریخ انتشار 2017